Decision Trees
Introduction
To continue with the previous analysis, we will build a decision tree model for the lithium companies financial information.
Below we can observe the libraries imported and the functions created that will be used in this section.
Libraries
# Import Data
import pandas as pd
# Data Cleaning
import numpy as np
# Visualization
import seaborn as sns
import matplotlib.pyplot as plt
# Split Dataset
from sklearn.model_selection import train_test_split
# Random Classifier
from collections import Counter
# Sklearn Decision Tree Model
from sklearn import tree
# Show Results
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
# Metrics
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_recall_fscore_support
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
np.random.seed(5000)Function: random_classifier()
## RANDOM CLASSIFIER
def random_classifier(y_data):
ypred=[];
max_label=np.max(y_data); #print(max_label)
for i in range(0,len(y_data)):
ypred.append(int(np.floor((max_label+1)*np.random.uniform(0,1))))
print("-----RANDOM CLASSIFIER-----")
print("count of prediction:",Counter(ypred).values()) # counts the elements' frequency
print("probability of prediction:",np.fromiter(Counter(ypred).values(), dtype=float)/len(y_data)) # counts the elements' frequency
print("accuracy",accuracy_score(y_data, ypred))
print("percision, recall, fscore,",precision_recall_fscore_support(y_data, ypred))Function: confusion_plot()
def confusion_plot(actual, pred):
matrix = confusion_matrix(actual, pred)
TP = matrix[0,0]
FN = matrix[0,1]
FP = matrix[1,0]
TN = matrix[1,1]
ACCURACY = (TP+TN)/(TP+FP+FN+TN)
NEGATIVE_RECALL = TN/(TN+FP)
NEGATIVE_PRECISION = TN/(TN+FN)
POSITIVE_RECALL = TP/(TP+FN)
POSITIVE_PRECISION = TP/(TP+FP)
print('ACCURACY: ', ACCURACY)
print('NEGATIVE RECALL (Y=0): ', NEGATIVE_RECALL)
print('NEGATIVE PRECISION (Y=0): ', NEGATIVE_PRECISION)
print('POSITIVE RECALL (Y=1): ', POSITIVE_RECALL)
print('POSITIVE PRECISION (Y=1): ', POSITIVE_PRECISION)
print(matrix)
cm = confusion_matrix(actual, pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot()Function: plot_tree()
def plot_tree(model, X, Y):
model_X_Y = model.fit(X, Y)
fig = plt.figure(figsize=(25,20))
_ = tree.plot_tree(model_X_Y,
feature_names=X.columns,
#class_names=list(str(Y.keys())),
filled=True)
plt.show()Methods
In Machine Learning, decision trees are classified as non-parametric supervised learning method used for both classification and regression models.
The goal of decision trees is to create a model that predicts the output value of the target variables based on different set of rules. This can be better understood as a set of questions where the first one, the root node, is a very general questions, and based on the answer the trajectory to make the next question. After several questions you can get to the final value.
To go from one inner node to another, there may be two possible paths, which we identify as binary split as it divides values into two subsets, or more tan two possible paths, which we identify as multi way split as it uses as many partitions as distinct values.
There are different types of decision trees, but in this analysis we will focus in binary classifier, where the output of each node is true or false.
Decision trees have several advantages such as that they are simple to understand and easy to interpret. This is due to the visualization of the tree that can be created in order to understand the logic and the conditions behind it. It also required minimal data preparation compared to other models. As it was explained before, they can handle both numerical and categorical data and combine them in a way to handle multi-outut problems.
One of the main disadvantages of the decision trees is that they tend to overfit, but there are several techniques that can be applied to reduce this, such as pruning or setting a maximum depth. The predictions done by the decision trees are local, this means that if we want to extrapolate to new values, they might not perform well. Lastly, we need to consider that if the dataset is not well balanced, the performance might be poor too because the results will tend to be biased.
Class distribution
The data that will be used for this analysis is the financial information for lithium companies. Considering the number of companies per specialization, we decided to filter out the dataset to keep the lithium producers and electric vehicle producers.
Original Data
Company KPI Year Quarter Value
0 AEHR Accounts Payable 2013.0 4.0 1892.0
1 AEHR Accounts Payable 2014.0 1.0 2265.0
2 AEHR Accounts Payable 2014.0 2.0 1892.0
3 AEHR Accounts Payable 2014.0 3.0 1586.0
4 AEHR Accounts Payable 2014.0 4.0 1790.0
(24874, 5)
Data Cleaning
After cleaning the data, we can see that we have all the financial information for the companies and 71 variables that will be used for the analysis, which all are in float format. The data has been reshaped to be in a wide format instead of a long format. Below there is a list of the companies by type of production.
Lithium Production: - ALB - LTGM - SGML - PLL
Electric Vehicles Production: - TSLA - F - LI - ON - RIVN - XPEV - LVWR - AEHR
{python}
df = df[(df['Year'] == 2022) & (df['Quarter'] == 4)]
df.drop(['Year','Quarter'], axis=1, inplace=True)
# - - -
# df['Month'] = df['Quarter'] * 3
# df['Day'] = 1
# df['Date'] = pd.to_datetime(df[['Year', 'Month', 'Day']])
# df.drop(['Year','Quarter','Day', 'Month'], axis=1, inplace=True)
df = df[df['Company'].isin(['ALB', 'LTHM', 'SGML', 'PLL', 'TSLA', 'F', 'LI', 'ON', 'RIVN', 'XPEV', 'LVWR', 'AEHR'])]
df = df.pivot(index=['Company'], columns='KPI', values='Value')
# df = df.pivot(index=['Company', 'Date'], columns='KPI', values='Value')
df.reset_index(inplace=True)
# df['Date'] = pd.to_datetime(df['Date'])
# df.set_index('Date', inplace=True)
df.head(5)KPI Company Accounts Payable ... Total Long-Term Liabilities Working Capital
0 AEHR 3949.0 ... 130.0 54789.0
1 ALB 2052001.0 ... 4524660.0 2445902.0
2 F 25605000.0 ... 115851000.0 19610000.0
3 LTHM 81700.0 ... 482500.0 395300.0
4 LVWR 12788.0 ... 10562.0 267487.0
[5 rows x 72 columns]
{python}
df.shape(11, 72)
Data Exploration
{python}
summary = df.describe().transpose()
summary['dtypes'] = df.dtypes
summary = summary[['dtypes', 'min', 'mean', 'max']]
summary dtypes min mean max
KPI
Accounts Payable float64 1936.000 5.381028e+06 2.560500e+07
Book Value Per Share float64 1.520 1.760091e+01 6.813000e+01
Cash & Cash Equivalents float64 36584.000 1.402432e+07 8.279000e+07
Cash & Equivalents float64 18874.000 6.614173e+06 2.513400e+07
Cash Growth float64 0.008 9.495900e+00 9.841530e+01
... ... ... ... ...
Total Debt float64 616.000 1.470467e+07 1.389690e+08
Total Liabilities float64 10876.000 2.742564e+07 2.127170e+08
Total Long-Term Assets float64 2008.000 2.109271e+07 1.394080e+08
Total Long-Term Liabilities float64 130.000 1.332009e+07 1.158510e+08
Working Capital float64 54789.000 6.448504e+06 1.961000e+07
[71 rows x 4 columns]
The next step is to divide the target column from the original data set. The target column is binary and has a value of 1 for the lithium manufacturing companies and 0 for the electric vehicle manufacturing companies.
{python}
df['target'] = df['Company'].isin(['ALB', 'LTHM', 'SGML', 'PLL']).astype(int)
df = df.drop(columns=['Company'])
length_0 = len(df[df['target'] == 0])
length_1 = len(df[df['target'] == 1])
total = df.shape[0]
print('Number of points with target=0: ', length_0, length_0/total)Number of points with target=0: 7 0.6363636363636364
{python}
print('Number of points with target=1: ', length_1, length_1/total)Number of points with target=1: 4 0.36363636363636365
In the correlation plot below, we can see the Pearson correlation between all the variables. There are some that are highly positively correlated and some that don’t have any correlation at all.
The dividend per share variable is the only one that has a negative correlation with the rest of the variables.
{python}
corr = df.corr()
sns.set_theme(style="white")
f, ax = plt.subplots(figsize=(20, 20))
cmap = sns.diverging_palette(230, 20, as_cmap=True)
sns.heatmap(corr, cmap=cmap, vmin=-1, vmax=1, center=0,
square=True, linewidths=.5, cbar_kws={"shrink": .5})
plt.show()Random Classifier Model
Before building a decision tree model, we decided to test a baseline model to compare results in the next steps. The baseline model that will be used is a random classifier, since there are no criteria for assigning labels. The goal of comparing the random classifier with the decision tree is to see if the performance of our model is better than the performance of assigning a random label.
We can see that we get an accuracy of about 0.518, which is very low.
{python}
target_column = 'target'
X = df.drop(columns=[target_column])
Y = df[target_column]
X_array = X
Y_array = Y
x_train, x_test, y_train, y_test = train_test_split(X_array, Y_array, test_size=0.2, random_state=0)
BINARY CLASS: NON UNIFORM LOAD
-----RANDOM CLASSIFIER-----
count of prediction: dict_values([6, 5])
probability of prediction: [0.54545455 0.45454545]
accuracy 0.5454545454545454
percision, recall, fscore, (array([0.66666667, 0.4 ]), array([0.57142857, 0.5 ]), array([0.61538462, 0.44444444]), array([7, 4]))
Decision Tree Model
Now that we have the random classifier model, we can build the decision tree model. To evaluate the results, we use the confusion matrix function that we built at the beginning of this section.
For the training set, we get very good results because the accuracy is equal to 1, so the other performance metrics, positive and negative recalls and precisions, are also equal to 1. For the test set, we get an accuracy of 0.66 approximately, which compared to the random classifier is performing better but it could be improved.
Train Results
ACCURACY: 1.0
NEGATIVE RECALL (Y=0): 1.0
NEGATIVE PRECISION (Y=0): 1.0
POSITIVE RECALL (Y=1): 1.0
POSITIVE PRECISION (Y=1): 1.0
[[4 0]
[0 4]]
Test Results
ACCURACY: 0.6666666666666666
NEGATIVE RECALL (Y=0): nan
NEGATIVE PRECISION (Y=0): 0.0
POSITIVE RECALL (Y=1): 0.6666666666666666
POSITIVE PRECISION (Y=1): 1.0
[[2 1]
[0 0]]
Tree Plot
{python}
plot_tree(tree.DecisionTreeClassifier(), x_train, y_train)Model tuning:
To get the best decision tree model, we can tune the hyperparameters of the model. There are two types of hyperparameters that can be tuned. First, the min_samples, which is the minimum number of samples needed to split an internal node. The second is the max_depth, which is the maximum depth of the tree.
Below, we apply the hypertuning process to the max_depth parameter. Considering the low number of data points for the analysis the test and train metric values vary through the number of layers in decision tree. Therefore, for this analysis we choose a maximum depth equal to 4.
{python}
test_results=[]
train_results=[]
for num_layer in range(1,20):
model = tree.DecisionTreeClassifier(max_depth=num_layer)
model = model.fit(x_train,y_train)
yp_train=model.predict(x_train)
yp_test=model.predict(x_test)
# print(y_pred.shape)
test_results.append([num_layer,accuracy_score(y_test, yp_test),recall_score(y_test, yp_test,pos_label=0),recall_score(y_test, yp_test,pos_label=1)])
train_results.append([num_layer,accuracy_score(y_train, yp_train),recall_score(y_train, yp_train,pos_label=0),recall_score(y_train, yp_train,pos_label=1)])
test_results = pd.DataFrame(test_results, columns= ("num_layer", "accuracy_score", "recall_score_0", "recall_score_1"))
train_results = pd.DataFrame(train_results, columns= ("num_layer", "accuracy_score", "recall_score_0", "recall_score_1"))
# ACCURACY
sns.scatterplot(data=test_results, x='num_layer', y='accuracy_score', label='Test Results', color='red')
sns.lineplot(data=test_results, x='num_layer', y='accuracy_score', color='red')
sns.scatterplot(data=train_results, x='num_layer', y='accuracy_score', label='Train Results', color='blue')
sns.lineplot(data=train_results, x='num_layer', y='accuracy_score', color='blue')
plt.xlabel('Number of layers in decision tree (max_depth)')
plt.ylabel('ACCURACY (Y=0): Training (blue) and Test (red)')
plt.show(){python}
# RECALL (Y=0)
sns.scatterplot(data=test_results, x='num_layer', y='recall_score_0', label='Test Results', color='red')
sns.lineplot(data=test_results, x='num_layer', y='recall_score_0', color='red')
sns.scatterplot(data=train_results, x='num_layer', y='recall_score_0', label='Train Results', color='blue')
sns.lineplot(data=train_results, x='num_layer', y='recall_score_0', color='blue')
plt.xlabel('Number of layers in decision tree (max_depth)')
plt.ylabel('RECALL (Y=0): Training (blue) and Test (red)')
plt.show(){python}
# RECALL (Y=1)
sns.scatterplot(data=test_results, x='num_layer', y='recall_score_1', label='Test Results', color='red')
sns.lineplot(data=test_results, x='num_layer', y='recall_score_1', color='red')
sns.scatterplot(data=train_results, x='num_layer', y='recall_score_1', label='Train Results', color='blue')
sns.lineplot(data=train_results, x='num_layer', y='recall_score_1', color='blue')
plt.xlabel('Number of layers in decision tree (max_depth)')
plt.ylabel('RECALL (Y=1): Training (blue) and Test (red)')
plt.show()Final results
Afer running the final results we can observe that the shape of the decision tree has changed, but the performance metric values are the same as the original model.
{python}
best_max_depth = 4
model = tree.DecisionTreeClassifier(max_depth=best_max_depth)
model = model.fit(x_train,y_train)
yp_train=model.predict(x_train)
yp_test=model.predict(x_test)
print("------TRAINING------")------TRAINING------
{python}
confusion_plot(y_train,yp_train)ACCURACY: 1.0
NEGATIVE RECALL (Y=0): 1.0
NEGATIVE PRECISION (Y=0): 1.0
POSITIVE RECALL (Y=1): 1.0
POSITIVE PRECISION (Y=1): 1.0
[[4 0]
[0 4]]
{python}
print("------TEST------")------TEST------
{python}
confusion_plot(y_test,yp_test)ACCURACY: 0.6666666666666666
NEGATIVE RECALL (Y=0): nan
NEGATIVE PRECISION (Y=0): 0.0
POSITIVE RECALL (Y=1): 0.6666666666666666
POSITIVE PRECISION (Y=1): 1.0
[[2 1]
[0 0]]
{python}
plot_tree(model,X,Y)Conclusions
As a conclusion, we can observe that the variables used to define the output of the tree are Effective Tax Rate and Total Assets. Considering the manufacturing activity of each company and how different they are, these variables are clear decision makers for the decision tree.
Anyway, for future analysis, additional companies and manufacturing activities have to be added in order to obtain better results and perform a multiclassifier decision tree.